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Abstract 


Many classical algorithms are found until several years later to outlive the confines in which 
they were conceived, and continue to be relevant in unforeseen settings. In this paper, we show 
that SVRG is one such method: being originally designed for strongly convex objectives, it is 
also very robust in non-strongly convex or sum-of-non-convex settings. 

More precisely, we provide new analysis to improve the state-of-the-art running times in 
both settings by either applying SVRG or its novel variant. Since non-strongly convex objec¬ 
tives include important examples such as Lasso or logistic regression, and sum-of-non-convex 
objectives include famous examples such as stochastic PGA and is even believed to be related 
to training deep neural nets, our results also imply better performances in these applications. 

1 Introduction 

The fundamental algorithmic problem in optimization is to design efficient algorithms for solving 
certain classes of problems. By distinguishing between smooth and non-smooth functions, between 
weakly-convex and strongly-convex functions, between proximal and non-proximal functions, or 
even between convex and non-convex functions, the number of classes grows exponentially and it 
may be unrealistic to design a new algorithm for each specific class. Taking into account such 
“design complexity”, it is beneficial to design a single method the works for multiple classes, or 
perhaps even more beneficial if this method is already widely used and happens to outlive the 
confines it was originally designed for. Easier done in practice, providing a support theory unifying 
the underlying classes for a specific method is particularly exciting, challenging, and sometimes even 
enlightening: the theoretical findings may further suggest experimentalists regarding how such a 
method should be best tuned in practice. 

In this paper, we revisit the SVRG method by Johnson and Zhang 13] and explore its applica¬ 
tions to either a non-strongly convex objective, or a sum-of-non-convex objective, or even both. We 
show faster convergence results for minimizing such objectives by either directly applying SVRG 
or modifying it in a novel manner. 

Consider the following composite convex minimization: 



n 


(l.i) 


i =1 


*The current version polishes the writing and adds more experiments. 
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Here, f(x) = is a convex function that is written as a finite average of n smooth 

functions fi(x), 1 and is a relatively simple (but possibly non-differentiable) convex function, 
sometimes referred to as the proximal function. Suppose we are interested in finding an approximate 
minimizer x E M. d satisfying F(x) < F(x*) + e, where x* is a minimizer of F(x). 

Examples. Problems of this form arise in many places in machine learning, statistics, and op¬ 
erations research. For instance, many regularized empirical risk minimization (ERM) problems 
fall into this category with convex /*(•)• In such problems, we are given n training examples 
{(ai,t'i), ... (a n ,£ n )}, where each a* E M. d is the feature vector of example i, and each G K is the 
label of example i. The following classification and regression problems are well-known examples 
of ERM: 


• Ridge Regression: fi(x) = ^((ai,x) — £i) 2 + f H^Hl and ’P(x) = 0. 

• Lasso: fi(x) = ^((ai,x) — t)) 2 and 'F(x) = cr||x||i. 

• £\-Regularized Logistic Regression: fi{x) = log(l + exp (—£i(ai,x))) and T(x) = cr||rc||i. 


Another important problem that falls into this category is the principle component analysis 
(PCA) problem. Suppose we are given n data vectors a±,... ,a n E R rf , denoting by A = ^ J2?=i a i a T 
the normalized covariance matrix, Garber and Hazan |8l showed that approximately finding the 
principle component of A is equivalent to minimizing f(x) = \x T (pi — A)x for some suitably 

chosen parameter p > 0. Therefore, defining fi(x) = \x T (pI — )x and ’P(x) = 0, this falls into 
Problem (1.1) with non-convex functions /«(•)• 


Background of SVRG. Stochastic first-order methods perform the following updates to solve 
Problem (1.1): 


x t+ i 


arg min \ —1 |y 


x dl2 + y) + ’i'(y)} ) 


where g is the step length, and (t is a random vector satisfying E[^] = V/(x<) which is referred to 
as the stochastic gradient. If the proximal function T (y) equals zero, the update simply reduces to 
x t+ i <- x t - r)£t- 

Given the “finite average” structure f(x) = y i )T)" =1 fi(x), a classical choice is to set £t = V fi(xt) 
for some random index i E [n] per iteration. Methods based on this choice are known as stochastic 
gradient descent (SGD). 

More recently, the convergence speed of SGD has been further improved with the variance- 
reduction technique [5, 13. 18. 21. 24, 25. 291. In all of these cited results, the authors have, in 
one way or another, shown that SGD can converge much faster if one makes a better choice of the 
stochastic gradient £ t , so that its variance IE[||£ f — Vf(xt )|||] reduces as t increases. 

One particular way to reduce the variance is the SVRG method described as follows 13]. Keep 
a snapshot x = xt after every m stochastic update steps (where m is some parameter), and compute 
the full gradient V/(x) only for such snapshots. Then, set = V/j(xt) — V/j(x) + V/( x) as the 
stochastic gradient. One can verify that, under this choice of £*, it satisfies E[£ t ] = Vf(xt) and 
linp->oo E[||£t — V/(xt)|||] = 0. 

Non-Strongly Convex Objectives. Although many variance-reduction based methods have 
been proposed, most of them, including SVRG, only has convergence guarantee of Problem (1.1) 
when the objective F(x) is strongly convex. However, in many machine learning applications, F{x) 

1 In fact, even if each fi(x) is not smooth but only Lipschitz continuous, standard smoothing techniques such as 

Chapter 2.3 of [11] can make each fi(x) smooth without sacrificing too much accuracy. 
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is simply not strongly convex. This is particularly true for Lasso 28] and ti-Regularized Logistic 
Regression [20], two cornerstone problems extensively used for feature selections. 

One way to get around this is to add a dummy regularizer ^ ||a?II 2 to F(x), and then apply any 
of the above methods. However, the weight of this regularizer, A, needs to be chosen before the 
algorithm starts. This adds a lot of difficulty when applying such methods to real life: (1) one 
needs to tune A by repeatedly executing the algorithm, and ( 2 ) the error of the algorithm does 
not converge to zero as time goes (in fact, it converges to O(A) so one needs to know the desired 
accuracy before the algorithm starts). Perhaps more importantly, adding the dummy regularizer 
hurts the performance of the algorithm both in theory and practice. 

Another possible solution is to tackle the non-strongly convex case directly [5. 18. 21], without 
using any dummy regularizer. These methods are the so-called anytime algorithms: they can be 
interrupted at any time, and the training error tends to zero as the number of iterations increases. 

While direct methods are much more convenient for practical uses, existing direct methods are 
much slower than indirect methods (i.e., methods via dummy regularization) at least in theory. 
More specifically, if the desired accuracy is e and the smoothness of each fi(x') is L, then the 
gradient complexities 2 of the best known direct and indirect methods are respectively 

O(^) and 0 ((n+ |) log . 

Therefore in theory, when n is usually dominating, indirect methods are faster but less convenient, 
while direct methods are slower but more convenient. 

In this paper, we propose SVRG ++ , a new method that solves the non-strongly convex case of 
Problem (1.1) directly with gradient complexity 0(n log 7 + j), outperforming both known direct 
and indirect methods. In particular, our complexity outperforms known direct methods (e.g., SAGA 
or SAG) by a factor £l(n/L) in the case when L < n. Since L is usually on the order of O(s) for 
large-scale machine learning problems where s is the sparsity of feature vectors and s can be much 
smaller than n, we claim that this outperformance may be significant in theory. On the practical 
side, SVRG ++ is a direct, anytime method, which is convenient to use. We describe SVRG ++ and the 
main techniques we use in Section 4. 

Sum-of-Non-Convex Objectives. If f(x) is er-strongly convex while each fi(x) is non-convex 
but L-smooth, Shalev-Shwartz discovered that the SVRG method admits a gradient complexity 
of 0((n + ^4) log j) for minimizing F(x) \22\ in the case of T(a:) = 0. A similar result has been 
independently re-discovered by Garber and Hazan [ 8 ] and applied to the PCA problem. This setting 
is also believed to be happening (at least locally) on training deep neural nets 3, 13, 22], 

Despite the missing proximal term 'P(x) in their analysis, the running time above is imperfect 
for two reasons. 

• First, this complexity is not stable: even if we modify only one of fi(x) from convex to (a 
little bit) non-convex, the best known gradient complexity for SVRG immediately worsens 
to 0 ((n + Fj) log 7 ) from 0 ((n + F) log ^). In contrast, one should expect a more graceful 
decay of the performance as a function on the “magnitude” of the non-convexity, or perhaps 
even a threshold where the performance is totally unaffected if the magnitude is “below” this 
threshold. 

• Second, the complexity does not take into account the asymmetry in smoothness. For in¬ 
stance, in PCA applications, each fi(x) can be very non-convex and its Hessian has eigenvalues 

throughout this paper, we will use gradient complexity as an effective measure of an algorithm’s running time. 
Usually, the total running time of an algorithm is O(d) multiplied with its gradient complexity, because each V fi(x) 
can be computed in O(d) time. 
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between —l< 0 and L > 0 where l can be significantly larger than L. Can we take advantage 
of this asymmetry to get better running time? 

In this paper, we prove that if each fi(x) is L-upper smooth and /-lower smooth (which means 
the Hessian of fi(x) has eigenvalues bounded between the same SVRG method admits 

a gradient complexity of 0((n + ^ + ^4) log -). This resolves both our aforementioned concerns. 
First, if l = 0(a), our new result suggests that the convergence of SVRG is asymptotically the 
same as the convex case, meaning there is a threshold 0(<5) that SVRG allows each fi(x) to be 
non-convex below this threshold for free. Second, in the l > L case, our result implies a linear 
dependence on the non-convexity parameter /, rather than the quadratic one 0((n + ^)log ^) 
shown by prior work 8. 22]. To the best of our knowledge, this is the first time that upper and 
lower smoothness parameters are distinguished in order to prove convergence results for minimizing 
( 1 . 1 ), 

Our improvement on SVRG immediately leads to faster stochastic algorithms for PCA 8. 271. 
Assume that A = - Y? , a^aT is a normalized covariance matrix where each a,- G M. d has Euclidean 
norm at most 1. Let A G [0,1] be the largest eigenvalue of A. Garber and Hazan showed that 
computing the leading eigenvector of A is, up to binary search preprocessing, equivalent to the 
sum-of-non-convex form of Problem (1.1). with upper smoothness L = A and lower smoothness 

1 = 1 3 Garber and Hazan further applied SVRG to minimize this objective and proved an overall 
running time 0((nd+ Jj) log ^). Our result improves this running time to 0((nd+j?) log j). Since 
A may be as small as 1/d, this speed up is significant in theory 4 

Since the original publication of this paper, our above PCA speed-up has also been translated 
to k- SVD, which is to compute the first k singular vectors of a given matrix [4], 

Our results above are non-accelerated for the sum-of-non-convex setting. One can apply Cat¬ 
alyst [7, 15] to further improve its running time when a is very small. Not surprisingly, our 
performance improvement carries to the accelerated setting as well. 

Finally, we also prove that our proposed improvements on SVRG (for non-strongly convex 
objectives and for sum-of-non-convex objectives) can be put together, leading to a new algorithm 
SVRG++ that works for both non-strongly convex and sum-of-non-convex objectives. This gives 
faster algorithms than known results as well. 

Roadmap. We discuss related work in Section 2 and provide notational background in Section 3, 
We state our result for non-strongly convex objectives in Section 4. for sum-of-non-convex objectives 
in Section 5 and 6, and for both non-strongly convex and sum-of-non-convex objectives in Section 7, 
In Section 8 and Section 9 we perform experiments supporting our theory. Most of the technical 
proofs are included in the appendix. 

2 Other Related Work 

The first published variance-reduction method is SAG [21]. SAG obtains an 0(log(l/e)) conver¬ 
gence (i.e., linear convergence) for strongly convex and smooth objectives, comparing to the 0(1/e) 

3 Suppose that the eigengap between largest and second largest eigenvalues of A is <5 = A — A 2 . Garber and Hazan 
showed that computing the principle component of A is, up to binary search preprocessing, equivalent to minimizing 
the objective f(x) == \x T (/xl — A)x + b T x where fj, = X + 5. If one defines fi(x) = f — cnaT)x + b T x, this 

minimization problem falls into the sum-of-non-convex setting of Problem (1.1), with upper smoothness L = fj, ~ A 
and lower smoothness 1 = 1. 

4 Garber and Hazan also applied acceleration schemes on top of SVRG, and obtained a running time 0( n „ d ). 
We can do the same thing here and improve their running time to 0( n 1 d ) in the accelerated setting. 
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rate of SGD [12, 23]. This 0{ log(l/e)) rate has also been obtained by several concurrent or subse¬ 
quent works, such as SVRG, MISO and SAGA [5, 13, 18]. SDCA 251 has also been discovered to 
be intrinsically performing some “variance reduction” procedure 22], 

Among the variance-reduction algorithms, only SAG, MISO, and SAGA can provide theoret¬ 
ical guarantees for directly solving non-strongly convex objectives (i.e., without adding a dummy 
regularizer). The best gradient complexity for direct methods before our work is 0(^-^) due to 
SAG and SAGA. On the other hand, if one uses indirect methods, the best gradient complexity is 
0((n + j) log ^), where the asymptotic dependence on e is weakened to log ^ £ ^ . 

We work directly with smooth functions fi(x) rather than the more structured fi(x) = 4>i((x, a*}). 
In the structured case, AccSDCA [26], along with subsequent works 16. 31], obtains a slightly bet¬ 
ter gradient complexity 0((n +min {L/e, ^ JnL/e }) log 7) for non-strongly convex objectives. This 
class of methods require one to work with the dual of the objective, require one to add dummy 
regularizer for non-strongly convex objectives (i.e., are indirect), and run only faster than the 
variance-reduction based methods when n < yjL/e. 

Since the original submission of this paper, we learned several other related works from the 
anonymous reviewers. First, the SVRG method was independently discovered and published also 
by [30]. Second, the result of [17] also uses doubling-epoch technique and can partially infer our 
results on SVRG ++ with a slightly more complicated proof and different algorithm 5 Third, in a 
concurrent accepted paper to this ICML, Garber et al. 9] improved the original Garber-Hazan 
PCA result 8] and thus solved a special case of our Theorem 6.1: their result has nothing to do 
with other theorems in this paper, especially Theorem 5.1 and 7.1 6 

In some concurrent works, the authors of [2] obtained our same running time on SVRG ++ through 
reductions. However, their algorithm is not a direct one so cannot be practically as good as SVRG ++ , 
Also after this paper is accepted, the author of [1] provided a direct method for solving (1.1) but 
in an accelerated speed. As mentioned in [l], his method can be combined with the technique in 
this paper to obtain a non-strongly convex accelerated running time. 


3 Notations 


Throughout this paper, we denote by || • || the Euclidean norm. We assume that each /)(•) is 
differentiable and f'(-) is convex and lower semicontinuous. 

We say that a differentiable function /)(•) is L-smooth (or has L-Lipschitz continuous gradient) 
if: 


l|V/j(x) — V/*(y)|| < L\\x — y\\ Vx,yeR d . 


The above definition has several equivalent forms, and one of them says for all x, y G 


p d. 


~^\\y~x\\ 2 < f(y) - (f(x) + (Vf(x),y-x)) < ^\\y - x\\ 2 . 

In this paper, we say /*(•) is L-upper smooth if it satisfies 

f(y)~ {f(x) + {Vf{x),y-x}) < ^\\y - x\\ 2 Vx,y£R d , 

5 Mahdavi et al. studied an oracle model where there are two gradient oracles, a stochastic one and a full-gradient 
one. Then, they prove comparable bounds to SVRG ++ but without supporting proximal terms and therefore do not 
directly apply to ERM problems such as Lasso or logistic regression. 

6 For the PCA problem, they produced the same 0((nd+ w) log 1) running time as we do; however, their result is 
only about PCA so does not solve general sum-of-non-convex objectives; they also did not introduce upper or lower 
smoothness like we do. 
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and /*(•) is /-lower smooth if it satisfies 

f(y)~ (/(®) + (V/(x),y-x)) > - l -\\y-x\\ 2 Vx,y£R d . 

Let us give a few examples: a convex differentiable function is 0-lower smooth; an L-smooth function 
is L-upper and L-lower smooth; a convex L-smooth function is L-upper and 0-lower smooth. 

We say a function /(•) is tr-strongly convex if 

}{y)~ {fix) + (V/(x),y-x)) > | \\y~x\\ 2 Vx,y£R d . 

Note that for a twice differentiable function /, the above definitions are equivalent to the corre¬ 
sponding statements about the eigenvalues of V 2 /(x). Indeed, L-upper smoothness is equivalent to 
saying all eigenvalues are no more than L, /-lower smoothness is equivalent to saying all eigenvalues 
are no less than —/, and cr-strong convexity is saying all eigenvalues are at least <j. 


4 SVRG ++ for Non-Strongly Convex Objectives 


In this section we consider the case of Problem (1.1) when each fi(x) is a convex function and the 
objective is not necessarily strongly convex. Recall that this class of problems include Lasso and 
logistic regression as notable examples. 

We propose our SVRG ++ algorithm for solving this case, see Algorithm 1. Given an initial vector 
x^, our algorithm is divided into S epochs. The s-th epoch consists of m s stochastic gradient steps 
(see Line 8 of SVRG ++ ), where m s doubles between every consecutive two epochs. This “doubling” 
feature distinguishes our method from all of the cited variance-reduction based methods. 

Within each epoch, similar to SVRG, we compute the full gradient fi s -i = V/(T S_1 ) where 
X s-1 is the average point of the previous epoch. We then use Jl s -i to define the variance-reduced 
stochastic gradient £, see Line 7 of SVRG ++ . Unlike SVRG, our starting vector x§ of each epoch is 
set to be the ending vector x*)) 1 ^ of the previous epoch, rather than the average of the previous 
epoch 7 

We state our main result for SVRG ++ as follows: 


Theorem 4.1. If each fi(x) is convex in Problem (1.1). then SVRG ++ (x^, mo, S, 77 ) satisfies if mo 
and S are positive integers and 77 = 1/(7L), then 


E[F(x s ) - F(x*)} < O 


F(x*) — F(x*) L\\x^ — x *|[ 2 


2 s ' 2 s mo 

In addition, SVRG ++ has a gradient complexity of 0(S ■ n + 2 s ■ mo). 


(4.1) 


As a result, given an initial vector x^ satisfying ||x^ — x *|| 2 < 0 and F(x^) — F{x*) < A for 
parameters 0, A G M + , by setting S = log 2 (A/e), mo = L0/A, and 77 = 1/(7L), we obtain an 
0(e) approximate minimizer of F(-) with a total gradient complexity 0(nlog (y) + ^). 


7 The theoretical convergence of SVRG relies on its Option II, that is to set the beginning vector of each epoch to 
be the average (or a random) vector of the previous epoch. However, the authors of SVRG conduct their experiment 
using the last vector rather than the average because it is more “natural”. This present paper partially shows that 
this natural choice also has competitive performance, and therefore confirms the empirical finding of SVRG. (Similar 
result can also be obtained for the strongly convex case, which we exclude for simplicity.) 
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Algorithm 1 SVRG ++ i/e^, mo, 5, rj) 

1 : x° •(— X&, Xq <— x& 

2 : for s 4— 1 to S do 

3: /z a _i <- V/CP- 1 ) 

4: m s 4— 2 s • mo 

5: for t •<— 0 to m ,5 — 1 do 

6: Pick i uniformly at random in {1, • • • , n}. 

7: V/i^-V/^X*- 1 ) +/!«-! 

8 ; ®t+i = arg milled {^||x? - y|| 2 + 'f'(y) + (£,y)} 

9: end for 

10 : 

11 : X ^ +1 <- X s ms 

12: end for 
13: return x s . 


High-Level Techniques. Our proof is based on a new way to telescope regret inequalities that 
is specially designed for growing-epoch methods. Unlike the analysis of SVRG, we telescope not 
only across iterations, see (A.2). but also across epochs, see (A.3), In contrast, the original SVRG 
has to rely on the strong convexity of /(•) in order to combine different epochs — this is why 
SVRG cannot directly solve non-strongly convex objectives. Our technique is also very different 
from known direct methods such as SAG or SAGA: to some extent, these methods can be viewed as 
having “equivalent” epoch length n, because each stochastic gradient in SAG or SAGA is updated 
once every n iterations on average. As a result, it may be hard to grow their epoch length. Finally, 
it is the telescoping across all epochs and all iterations that requires the starting vector of an epoch 
to be the last one from the previous epoch (which is different from SVRG). We shall demonstrate 
in our experiment section that these modifications on top of SVRG are also useful in practice. 

Our full proof of Theorem 4.1 is included in Appendix A, 

4.1 Additional Improvements 

Inspired by SVRG ++ . we also introduce SVRG_Auto_Epoch, a variant of SVRG ++ where epoch length 
is automatically determined instead of doubled every epoch. Auto epoch is an attractive feature in 
practice because it enables the algorithm to perform well for different types of objectives. 

The criterion we use to determine the termination of epoch s in SVRG_Auto_Epoch is based on the 
quality of the snapshot full gradient V/(a? s_1 ). Intuitively, if epoch length is too long, an algorithm 
may move too far from the snapshot point, meaning that the gradient estimator £ may have a large 
variance. Following this intuition, for every iteration t, we record diff* = ||V/*(xf) — V/i(x s—1 )||| 
because Ej[difft] is a very tight upper bound on the variance of the gradient estimator (see the 
proof of Lemma A.2). Under this notion, we decide the epoch termination of SVRG_Auto_Epoch as 
follows. Each epoch has a minimum length of n/4. From iteration t = nj 4 onwards, we keep track 
of the average diffj in the last n/4 iterations, i.e., £* =t _ n / 4 _ ) _ 1 diffj. If this quantity is greater 
than half of the average diffj recorded from the previous epoch, we terminate the current epoch 
and start a new one 8 SVRG_Auto.Epoch shows good performance in our experiments, and we leave 
it as an open question to prove a complexity result for this method. 

In addition to auto epoch, SVRG ++ can also be combined with other enhancements proposed 
for SVRG. For example, [10 saves the time to compute full gradients at snapshot points by making 

8 We always set the first epoch to be of length n/4 and the second to be of length n/2. 
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Algorithm 2 SVRG(x^, m, S, rj) [131 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 




X 

for s <— 1 to S do 

J*8-l V/^- 1 ) 

for t <— 0 to m — 1 do 

Pick i uniformly at random in {1, • • • , n}. 

v/i(xf)- v/ i (p- i ) + ^_ 1 
= argmin^d {^||xf - y || 2 + *{y) + (£, y)} 

end for 


XS 


4 _k V ” 1 

x \ m 2st=l ' 
s+1 
0 


Xq ■<— X 

end for 
return x s 


S 


them less accurate in the first a few epochs. 14] uses mini-batch gradients per iteration to further 
decrease the variance. These ideas are orthogonal to our proposed techniques and therefore can be 
applied to further improve the performance of SVRG ++ , 

5 SVRG for Sum-of-Non-Convex Objectives I: 

Small Lower Smoothness 

In this section we consider Problem (1.1) when each /j(x) is not necessarily convex, L-upper smooth, 
and /-lower smooth for some 0 < l < L. We assume that /(•) is u-strongly convex. For this class of 
objectives, the best known gradient complexity for stochastic gradient methods is 0((n + 4r) log j) 
due to SVRG [221. 

This gradient complexity is essentially a factor L/cr greater than that for the convex case, 
that is 0((n+ ^jlog^). Following the intuition discussed in the introduction, we improve it to 
0((n + ^ + ^)log^), a quantity that is asymptotically the same as the convex setting when 
/ < 0(a), and linearly degrades as / increases. 

Recall that the original SVRG (Option II) works as follows (see Algorithm 2 for completeness). 
Given an initial vector , SVRG is divided into S epochs, each of length m for the same m across 
epochs. Within each epoch, SVRG computes the full gradient Ji s -\ = V/XP” 1 ) where P~ x is the 
average point of the previous epoch. Then, SVRG uses Jl s -i to define the variance-reduced version 
of the stochastic gradient £, see Line 6 of Algorithm 2, The starting vector Xq of each epoch is set 
to be the average vector of the previous epoch 9 

We state our main result for SVRG in this section as follows: 

9 This choice of the starting vector is different from SVRG ++ , but was the original choice made by SVRG. Similar 
result can also be obtained using the choice from SVRG ++ . 








Theorem 5.1. If each fi(x) is L-upper and l-lower smooth in Problem (1-1) for 0 < l < L, f(x) 
is a-strongly convex, r] = q§]j} o,nd m > ^ = fl(max{^, bj}), then SVRG(F, m, S, rj) 

satisfies a 

E[F(F) - F(x*)} < ^ (L(F _1 ) - F(x*)) . (5.1) 

Therefore, by setting S = log 4/ / 3 ^ F ( x4, )~ F ( x ) ) ? { n a total gradient complexity of 

„(( L , F(ai*) - F(x*)\ 

O\^{n + — max {l, -}J log---J , 

we obtain an output x s satisfying E[L(F) — F(x*)] < e. 

“Here we have assumed that the first s — 1 epochs are fixed and the only randomness comes from epoch s. 

Our technique for proving this theorem depends on the following new upper bound on the 
variance. Denoting by the stochastic gradient £ at epoch s and iteration t, and denoting by if 
the random index i chosen at epoch s and iteration t, we have 

Lemma 5.2. 


% [||& s - V/(x?)|| 2 ] < 4 (L + l ) • (F(xl) - F(x*) + ^(F- 1 ) - F(x*)) 

+ (8 1 2 + 4LZ) (||x® - x*\\ 2 + \\x s - l - x*|| 2 ) . 

This is different from Section 4.1 of 221, where the author only provided a weaker upper bound 
0{L 2 ) ■ (|| xf — x* || 2 + \\x s 1 — F|| 2 ). In the event that l is very small, our new upper bound reduces 
to the variance upper bound in the convex setting, see for instance Eq. (8) of [131. 

The full proof of Theorem 5.1 is included in Appendix B. 


6 SVRG for Sum-of-Non-Convex Objectives II: 

Large Lower Smoothness 

In this section we consider Problem (1.1) when each fi(x) is not necessarily convex, L-upper smooth, 
and /-lower smooth function for some l > L. We assume /(•) is c-strongly convex. For this class of 
objectives, the best known gradient complexity for stochastic gradient methods is 0((n+ ^) log j) 
due to [221 • 

This known gradient complexity is essentially a factor l 2 /L 2 > 1 worse than that of the sym¬ 
metric case (i.e., the case when l = L). In this section, we improve this factor to l/L which is 
quadratically faster than l 2 /L 2 . As we have explained in the introduction, this result improves the 
convergence for the best known stochastic algorithm for PCA. 

We state our main result for SVRG in this section as follows. 
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Theorem 6.1. If each fi(x) is L-upper and l-lower smooth in Problem (1.1) for l > L > 0, f(x) 
is a-strongly convex, r) = an< ^ rn — an = ^(yj); then SVRG (x$,m, S,rj) satisfies 

E[F{x s ) - F{x*)} < ^ (F(F _1 ) - F(x*)) . (6.1) 

Therefore, by setting S = log 4 / 3 ^ F ( x<t> )- F ( x ) ) ; a total gradient complexity of 


„(( LI \ F(s^) ~F(x*)\ 

°(( n+ ^) l0g £ ) ’ 


we obtain an output x s satisfying E[L(x s ) — L^x*)] < e. 


Although Theorem 6.1 (for the large l setting) has the same form as Theorem 5.1 (for the small 
l setting), its proof is quite different. In order to provide a variance bound without paying the 
l 2 factor as in Lemma 5.2, we negate the objective for analysis purpose only. This is reasonable 
because —/*(•) becomes l upper smooth but only L lower smooth for L < l. By applying the 
smoothness lemmas for minimizing —/,(■) (and thus maximizing fi(x)), we obtain a better variance 
upper bound without paying the factor l 2 . 

Details of the proof is included in Appendix C, 


7 SVRG+ C + for Non-Strongly Convex AND Sum-of-Non-Convex Ob¬ 
jectives 

In this section we show that our improvements for (1) non-strongly convex objectives in Section 4 
and for (2) sum-of-non-convex objectives in Section 5 and 6 can be non-trivially put together. That 
is, we consider the case of Problem (1.1) when each fi(x) is a not-necessarily convex function but 
L-upper and Llower smooth for l > 0. We assume that /, the average of functions /,;, is simply 
convex but not necessarily strongly convex. 

For this class of objectives, if one applies a classical regularization (by adding a dummy f ||a;|| 2 
regularizer for a '= ) reduction to that of Shalev-Shwartz 22], we can obtain a gradient 

complexity of essentially 0((n + ^£) log ^). If one applies the same reduction to our new analysis 
in Section 5 and 6, we can obtain a gradient complexity of essentially 0((n + ^ + ^) log ^). Note 
that the so-obtained algorithms are indirect and biased. 

We propose a direct algorithm SVRG+L for solving this class of objectives with a gradient 
complexity of 0(nlog ^ + j + jj). 

Our SVRG++ algorithm for this case is analogous to SVRG ++ in Section 4, Given an initial vector 
x^, our algorithm is divided into S epochs. The s-th epoch consists of m s stochastic gradient steps, 
where m s doubles between every consecutive two epochs. As before, within each epoch we compute 
the full gradient jl s -\ = V/(x s_1 ) where 5? s_1 is the average point of the previous epoch. We use 
also Jl s -1 to define the variance-reduced version of the stochastic gradient £. Unlike SVRG ++ . for 
analysis purpose the step length r/ is no longer a constant throughout the iterations. However, it 
will almost remain a constant. 

More precisely, define T = mi + • • • + ms < 2mo • 2 s to be the total number of iterations. Then, 


10 






Algorithm 3 SVRG+ C + <x^. mo, S. rj) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


r 4> 


x 4— x 

for s 4— 1 to S do 

J*8-l V^P- 1 ) 

m s 4— 2 s ■ mo 

k 4 — 0 and T m± + • • • + ms 

for t 4— 0 to m s — 1 do 

Pick i uniformly at random in {1, 

V/ i (sf)-V/i(s- 1 )+^_i 

Vt 


,n}. 


k <— k + 1 and r/| +1 


*t+i = argmin^d 


? ? y/2 T—k ' 

1 "®t-j /|| 2 + ^(z/) + (£,i/>} 


m s —1 


end for 

~S /_ L V ms 

X m„ 2s t =0 

. S +l y_ 

'0 

end for 
return x s 


for some parameter r] > 0 to be chosen later, we define the sequence of step lengths 


( 


Vo , Vl, • • ■ Vm 1 (= ? ?o), Vi, ■ ■ ■> ia {=rio),Vl, 


s \ def niVr 7 ]Vt r)V t\ 

rimS ) ~ VV2T’ V / 2T^T , "‘’ Vr J 


Note that in the above definition, the last step length rj^ is chosen as the same as the first step 
length ?/q + 1 of the next epoch. We also have ^ < 77 ® < i] for all epochs s and all iterations 

t £ {0,1,, m s }. Since for every real k > 1 we have \fk — ^Jk — 1 > — |^=, it satisfies that 


Vt +l r lt > 2r]VTs/2T 2^2r/T 



8 Experiments on Empirical Risk Minimization 

We confirm our theoretical findings using four real-life datasets: (1) the Adult dataset (32,561 
examples and 123 features), (2) the Covtype dataset (581,012 examples and 54 features), (3) the 
I j cnnl dataset (49990 examples and 22 features), and (4) the 2nd class of the MNIST dataset (60, 000 
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I -1 SDCA(r=2.0E-6) 
SAGA(0.4) 
t-1 SVRG(0.5) 

—| SVRG-Auto-Epoch(2.0) 
M SVRG++(2.0) 



(b) Adult, Logistic a = 10 



I-I 

SDCA(r=1.0E-6) 

f-f 

SAGA(0.3) 

t-i 

SVRG(0.3) 


SVRG-Auto-Epoch(0.9) 

H 

SVRG++(0.9) 

-i 


(c) Covtype, Lasso a = 10 






Figure 1: Selected performance comparisons for lasso and logistic regression using Tuning Type 
I. Our comprehensive comparisons for other regularizer weights as well as ridge regression can be 
found in Figure 3. 4, 5. and 6 in the appendix. 

examples and 780 features) [6]. In order to make easy comparisons between different datasets, we 
scale each data vector down by the average Euclidean norm of the whole data set. This step is for 
comparison only and not necessary in practice. 

We perform 3 classification tasks: Lasso, ridge regression , and i\-regularized logistic regres¬ 
sion. As described in the introduction, Lasso and logistic regression do not admit strongly con¬ 
vex objectives, while the ridge objective is strongly convex. We consider four different values 
a G {ICE 3 ,1CT 4 ,ICE 5 , ICE 6 }, where a is either the weight in regularizer f ||x ||2 for ridge, or that in 
regularizer <r||a;[|f for Lasso and logistic regression. 

We have implemented the following algorithms: 

• SVRG ++ with initial epoch length thq = n/ 4. 

• SVRG_Auto_Epoch as we described in Section 4.1, 

• SVRG 1 13. 29] with (their suggested) epoch length m = 2 n. (Recall that, in theory, SVRG is 
not designed for non-strongly convex objectives and F(-) needs to be added by a dummy 
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(a) parameterized by 5 


(b) parameterized by k 


Figure 2: Performance analysis on sum-of-non-convex objectives. Note that the curves for 5 = 
0 .001,0.01,0.02 have overlapped in (a). 


regularize!' for Lasso and logistic regression. However, in our experiments, we observed that 
this dummy regularizer is not necessary, so have neglected the regularized version of SVRG for 
a clean comparison.) 

• SAGA [5], 

• SDCA 24, 25] with Option I (steepest descent). Since SDCA works only with strongly convex 
objectives, a dummy regularizer has to be introduced for Lasso and Logistic regression. 

For each algorithm above except SDCA, we tune the step length carefully from the set {a x lCP fc : 
a £ {1,2,,.., 9}, k £ Z} for each plot. For SDCA on Lasso and logistic regression, we also tune the 
weight of its dummy regularizer from the set {10~ fc ,2 x 10~ fc ,5 x lCC fe : k £ Z}. To make our 
comparison stronger, we adopt an anonymous reviewer’s suggestion and consider two types of 
parameter tuning. In Tuning Type I, we select the best curve based on the training objective 
performance in the entire 30 passes to the dataset. In Tuning Type II, we select the best parameter 
only based on method’s performance in the first 4 passes to the dataset. Tuning Type II might be 
more realistic for experimentalists who need to quickly pick the best parameters of the algorithms. 

In each plot, we run 10 times the experiments and plot both the mean and the variance. Since 
our plots are in log scale, we only keep the upper error bar to make the plots easier to read. In 
other words, the lower end of each error bar represents the mean of each data point. 
Performance Comparison. We have picked a representative regularizer weight a for each of 
the eight analysis tasks (lasso or logistic regression on one of the four datasets), and presented the 
performance plots using Tuning Type I in Figure 1. For the results on other values of a as well as 
those for ridge regression, see Figure 3, 4, 5, and 6 in the appendix. We have also included plots 
using Tuning Type II in Figure 7, 8. 9, and 10 in the appendix. 

In all of our plots, the y-axis represents the training objective value minus the minimum, and the 
x-axis represents the number of passes to the dataset. Here, following the tradition, one iteration 
of each algorithm counts as 1/n pass of the dataset, and the snapshot full-gradient computation of 
SVRG, SVRG + . and SVRG_Auto_Epoch counts as one additional pass. 

In the legend of each plot, we use SDCA(?’ = tq) to denote that ?’o is the weight of the best-tuned 
dummy regularizer. For every other algorithm, we use Alg(y) to denote that y is the best-tuned 
step length for algorithm Alg. 


13 













We make the following observations from this experiment: 

• SVRG ++ and SVRG_Auto_Epoch consistently outperform SVRG in all the plots, indicating that 
they do improve over SVRG in non-strongly convex settings. 

• SVRG ++ and SVRG_Auto_Epoch outperform SAGA in most cases, and are at least comparable 
to SAGA in the rest cases. This is not surprising because SAGA is also a direct algorithm for 
non-strongly convex objectives. 

• SVRG ++ and SVRG_Auto_Epoch significantly outperform indirect methods via dummy regular¬ 
ization (i.e., SDCA) in the non-strongly convex settings. For ridge regression which is strongly 
convex, SDCA is comparable to other methods (see the figures in the appendix). 


9 Experiments for Sum-of-Non-Convex Objectives 

To verify our theoretical findings in Section 5 and 6, we run SVRG on a sum-of-non-convex objective 
built from synthetically generated data. We generate n = 500 random vectors a \,..., 0500 £ 
from the d = 200 dimensional unit cube and then normalize them to have Euclidean norm 1. Define 
the covariance matrix A = ^ Y^i= 1 a i a J > and we consider the minimization problem 10 

min \f(x) = + bx) 

x€R d 1 21 


for some randomly generated vector b. 

The matrix A we generated has minimum eigenvalue equal to 7.02 x 1CU 4 , and thus f(x) is 
strongly convex with parameter 7.02 x 1CU 4 . Next, we decompose f(x) into an average of fi(x), 
each being non-convex with upper and lower smoothness parameters that we can control. 

More specifically, given n diagonal matrices D 1 , • • • , D n satisfying D\ + • • • + D n = 0, by setting 


fiix) = 


def x 1 (af ai~\~Di)x 


+ bx, we have f(x) = ^ Under this construction, each /) is non- 


convex if Di has negative entries in the diagonals. We now consider two different ways to build 
D ±,..., D n . 


Remark 9.1. We do not perform real-life PCA experiments for the following reason. Recall Garber 
and Hazan reduced PCA to minimizing f(x) = — A)x + b T x. For all interesting choices of 

H, our result in Theorem 6.1 is faster than theirs by the same constant factor A £ [1/d, 1], which 
is the largest eigenvalue of A. Therefore, by varying fj. and comparing the plots, it is impossible to 
observe anything interesting: in particular, one cannot conclude our theoretical bound is tighter in 
practice. In contrast, our carefully designed synthetic experiment allows us to control the upper and 
lower smoothness parameters, and therefore to observe the improvements of our theorems directly. 

Our first experiment is parameterized by a given value 5 £ [0,1]. For each j £ [d], we randomly 
select half of the indices i £ [n] and assign its j'-th diagonal (Di)jj to be <5; for the other half of the 
indices i we assign (Di)jj to be —5. In this way, we satisfy D\ + • • • + D n = 0 and for each i £ [n], 
we have —51 < V 2 fi(x ) < (1 + 5)1. In other words, each function fi(x) is L k, 1 upper smooth and 
exactly l = 5 lower smooth. This corresponds to the l < L regime studied by Section 5. 

Our second experiment is parameterized by a given value k £ [1,to]. For each j £ [d], consider 
the j-th diagonal entry of all the matrices, (D' 2 )jj- ■ ■ ■ {D n )jj- We randomly select one of 

these entries and set it to be —k, and the rest n — 1 of them to be -Wy. Under this definition, we 

10 Since x* = A~ 1 b this is a linear system problem. 
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have D\ + • • • + D n = 0 and for each i G [n], we have —kl < V 2 /i(x) < (1 + k/(n — 1))/. In other 
words, each function /j(x) is approximately L ~ 1 upper smooth and l = k lower smooth. This 
corresponds to the l > L regime studied by Section 6. 

We run SVRG (with the best tuned step length) for both experiments, and plot the performance 
in Figure 2, We make the following observations from the plots: 

• In Figure 2(a), we observe that the performance SVRG is approximately linearly proportional 
to IL = 0(5) for large 5, as compared to L 2 = 0(1) from prior work. More importantly, SVRG 
is robust against small non-convexity parameter l. Indeed, for l = 5 < 0.02, the convergence 
of SVRG is as fast as the convex case (i.e., (5 = 0 case). This confirms our theoretical finding in 
Section 5 and particularly confirms the existence of a threshold O(a) where the performance 
of SVRG only starts to degrade when l is above this threshold. 

• In Figure 2(b). we see that the performance of SVRG is approximately linearly proportional 
to IL = O(k), as compared to l 2 = 0(k 2 ) from prior work. This confirms our finding in 
Section 6, 

Appendix 


A Convergence Analysis for Section 4 

For each outer iteration s G [S'] and inner iteration t G {0,1,..., m s — 1} of SVRG ++ , we denote by if 
the selected random index i G [n] and £f the stochastic gradient £ = V/,;» (xf) — V/j|(x s_1 ) + Jd s -i- 
Then, using the convexity and smoothness of our objective, as well as the definition of our stochastic 
gradient step, we obtain the following lemma: 

Lemma A.l. For every u G and t G {0,1,..., m s — 1}, fixing xf and letting i = if be the 
random variable, we have 


E it [F(xf +1 ) - F(u)} < Ej| 


V 


L2(l -rjL) 

Proof. We first upper bound the left hand side: 


Ui - V/(x? 


+ 


\xt — u 


- m+i-u 


12., 


2rj 


Ei? [F(xf +1 ) - F(u)] = Ejs [f(xf +1 ) - f(u) + *(xf +l ) - T(u)] 

< [/(*t) + (Vf(xf),x s t+1 - xf) + §II®? - xf +1 \\' 2 - f(u) + tf(x? +1 ) - T(u)] 

< [(Vf(xf),xf -u) + (Vf(xf),xf +1 - xf) + ±\\xf - xf +1 || 2 + *(sf +1 ) - *(uj\ 

= Eij [(&,xf ~u) + (Vf(xf),xf +1 - xf) + %\\xf - xf +1 \\ 2 + ^(x| +1 ) - T(u)] . (A.l) 

Above, inequalities ® and ® are respectively due to the smoothness and convexity of /(•), and © 
is because Ej|[£®] = V/(xf). Next, using the dehnition of x s t+l we have 


(ft,xf -u) + T(x? +1 ) - T(u) = (£|, xf - xf +1 ) + (£", xf +l - u) + ^(x? +1 ) - T(u) 

< (ft » x t - x t+i) + (-^(*?+1 - x t), x t +1 - u) 


= (it, x f-xf +1 ) + 


\xf — u\\ 2 


l*t+l -“II 2 


I I I ^ 

\ X t + l X t\\ 


2rj 


2 rj 


2 1) 
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Above, inequality © holds for the following reason. Recall that the minimality of xf +1 = argmiiij /gR d{^||?/— 
xf|| 2 + 'P(y) + (4,4} implies the existence of some subgradient g E d\l/(xf +1 ) which satisfies 
4(xf +1 — xf) + 4 + g = 0. Combining this with 'I' (u) — \I/(xf +1 ) > (g,u — xf +1 ), which is due 
to the convexity of T(-), we immediately have 'l'(tt) — 'h(xf +1 ) + ( 4 (x s +1 — xf) + — xf +1 ) > 

(4(xf +1 — xf) + 4 + g, u — xf +1 ) = 0. This gives inequality ®. In addition, © can be verified by 
expanding the Euclidean norms. 

Combining the above two inequalities, we have 


Ei? [F(x s t+1 ) - F(u)] 


< E,;. 


< E,;. 


(4 - V/(xf),xf - xf +1 ) - 1 2 J L ||xf - x: 


11 _ ,,, || 2 11 _ „. || 2 

2 . ll x t “II ll x i+l u 

t+lll 


2 rj 


V 


2(1 - r? L) 


114 s - V/(x 


| I ,-y.S .) . 112 I I _ „. || 2 

2 ll"4 “II ll x t+l “ 

tJU ‘ 


2rj 


Above, © is by Young’s inequality. dl 

The next lemma is classical and analogous to most of the variance reduction literatures (cf. 5, 
13, 29]). We include it here for the sake of completeness. 

Lemma A.2. E i? [||4 - V/(xf) || 2 ] <4 L- (F(xf) - F(x*) + T/p- 1 ) - F(x *)). 

Proof. The proof of this lemma is classical and is analogous to most of the variance reduction 
literatures (cf. [5. 13. 29]). Indeed, 

[114 - V/(xf)|| 2 ] = E it [|| (V/ij(xf) - V/^®- 1 )) - (V/(xf) - V/(F-4) || 2 ] 
<E i| [||V/ i| (xf)-V/ i? (F- 1 )|| 2 ] 

= E it [|| (V/i.(xf) - V/ i? (x*)) - (V^KF- 1 ) - V/i.(x*)) || 2 ] 

< 2 • E,| [||V/i-(xf) - V/if(x*)|| 2 + ||V/i|(x s_1 ) - V/ i? (x*)|| 2 ] . 


Above, ® is because for any random vector ( E M d , it holds that E||( — E (|| 2 = E ||(|| 2 — ||E(|[ 2 , and 
© is because for any two vectors a, b E R d , it holds that ||a — 6|| 2 < 2||a || 2 + 2||6|| 2 . 

Next, the classical smoothness assumption on a function /* yields (see for instance Theorem 
2.1.5 in the textbook [19]) || V/;(x) - V/i(x*) || 2 < 2L[/j(x) - fi(x*) - (V/;(x*), x - x*)). Plugging 
this into the above inequality, we have 


Eif[ll4-V/(xf)|| 2 ] 

< 4L • Ej| [/i*(xf) - /i.(x*) - (V/j|(x*),xf - x*) + f it (F- 1 ) - / i? (x*) - (V/^®*),®- 1 - x*>] 
= 4L • (/(xf) - /(x*) - (V/(x*), xf - x*) + /(F- 1 ) - /(x*) - (V/(x*),x s_1 - x*» 

= 4L • (/(4) - /(®*) + (4,4 - ®*) + /( xS_1 ) - /( x *) + (44 s-1 - **)) 

< 4L • (/(xf) - /(x*) + 'P(xf) - T(x*) + /(X s - 1 ) - /(x*) + T(F- X ) - *(x*)) 

= 4L • (F(xf) - F(x*) + F(x s ~ l ) - F(x*)) . 


Above, 5 * E 9T(x*) is the subgradient of T at x* that satisfies V/(x*) + g* = 0. 
We are now ready to prove the main theorem for the convergence of SVRG ++ : 


□ 
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Proof of Theorem f.l. Combining Lemma A.l with u = x* and Lemma A.2. we have 


r 2 riL , , Ilxf — x*|r — Ejs Ilxf, — x 

E^Fixt+J-Fix*)] < ,{F(x s t )-F(x*)+F(x s - 1 )-F(x*)) + 1 ^ - " m 


* 112 


(1 — r]L) 


2 7J 


Choosing 77 = 1/(7L) in the above inequality, summing it up over t = 0,1,, m s — 1, and dividing 
both sides by m s , we arrive at 


m a — i 


E [ E <e[(( ^ ^-F( I -)+F(S‘- 1 )-F(x*)) + ! 


-1 z ”^ 1 F(xf 


rp& _ xy. 5 ^ I I 2 _ I I _ rpS I I 2 

ih n tL II II tL /vvj 11 


t =0 


t =0 


2 rj ■ m s 


After rearranging, this yields 
m<J ~ 1 F(r s 1 

2E[ ]T /^ +lj - F(x*)] < E 


(A.2) 


t =o 




(F(xg)-F(x*))-(F(x^J-F(x*)) 


+ F(F _1 ) - F(x*) 


+ 


m. 


I xy.S _ xy.* I I 2 _ I I XV.* _ rpS I I 2 

I tXx tXX II II t/x iL* I I 


2 / 7/3 • m s 


^ ^ ^ F(x s ) 

Next, using the fact that F(x s ) < Yt=o — m t+1 due to the convexity of F and the definition 
= YT=o ~fY~’ as we ^ as the choice x s ms = Xq , we rewrite the above inequality as 

-(F(x s 0 ) - F(x*)) ~ (F(x s 0 +1 ) ~ F(x*)) 


2E [F(x s ) - F(x*)] < E 


m. 


+ F(x s_1 ) - F(x*)) 


+ 


,q * 11 2 11 * 

xv»_ xy» n " _ xy* _ xy* 

■X/ n iL II II tu tlx I 


s+li|2. 


277/3 • m s 

After rearranging and using the fact m s = 2m s _i, we conclude that 


(A.3) 


2E[F(x s ) - F(x*) + 


\x — X, 


S+1II2 


< E 


F(F _1 ) - F(x*) + 


477/3 • m s 

n S xy.*l|2 


+ 


F(xg +1 ) -F(x*)- 

2m, 


Xq — X*||“ F(Xq) — F(x*) 


477/3 • m s _i 

In sum, after telescoping for s = 1, 2,..., S, we have; 11 


+ 


2m s _i 


E[ F(x-) - W1 < 2- ■ (F<?) - F,x.) + + A4Iz££>) 


< F(x^) — F(x*) ||x^ — x*|[ 2 


2 s-l 


25 . 4 rjm 0 


This finishes the proof of (4.1) due to the choice of ij = 1/(7 L). Finally, SVRG ++ computes S times 
the full gradient V/(-), and Ys=i m s = O(2*mo) times the gradient V/i(-). This gives a total 
gradient complexity 0(5 ■ n + 2 s ■ mo). EH 

xl We can perform telescoping because we set our starting vector a// 1 of each epoch to equal the ending vector x 
of the previous epoch. This is different from SVRG, which chooses the average of the previous epoch as the starting 
vector. This difference is also beneficial in practice (see Section 8'. 
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B Convergence Analysis for Section 5 


As in Section 4, for each outer iteration s £ [S] and inner iteration t £ {0,1 ,m — 1} of SVRG, 
we denote by i s t the selected random index i £ [n] and £f the stochastic gradient £ = V/L (xf) — 
V/j|(x s_1 ) + Ji s - 1 - Then, the following lemma is a counterpart of Lemma A.l where the only 
difference is the use of the strong convexity parameter cr: 

Lemma B.l. For every u £ and t£{0,l,...,m — 1}, fixing x\ and letting i = if be the random 
variable, we have 


E it [F(x s t+1 ) - F(u)\ < Ej; 


V 


2(1 — rjL) 


ll^-V/(x t *)|| 2 + 


(1 - ar 


xi — u\\ — 


\ x t+i ~ u 


12 _ 


2 Tj 


Proof. We first upper bound the left hand side using the strong convexity and smoothness of /(•): 


E i.[F(sf +1 )-F(«)] 

= Ej| [/(x t s +1 ) - /(«) + T(xf +1 ) - *(«)] 

< Ei ? [/(xf) + (V/(xf), xf +1 - xf) + |||xf - xf +1 || 2 


<E i? [(V/(xf),xf-«) 



+ <V/(xf),x 


= E i| [(£ t s ,xf-u> 



+ (V/(xf), x 


S 

t+1 


-/(u) + T(xf + i) -T(u)] 

?+i - + \\\x\ - xf +1 || 2 + *(xf +1 ) - *(«)] 

+ §11 x l - ®f+ill 2 + ^( x t+ i) - ^(«)] 


(B.l) 


Above, the term |||xf — «|| 2 is due to the cr-strong convexity of /(•), and this is the only difference 
between the inequalities (B.l) and (A.l), Therefore, Lemma B.l can be proven using exactly the 
identical rest of the proof of Lemma A.l. EH 


We next state and prove a counterpart of Lemma A.2. 

Lemma 5.2, 

E*| [Ut ~ V/(xf) || 2 ] < 4 {L + l) • (F(xf) - F(x*) + F{x s ~ l ) - F(x*)) 

+ (8 1 2 + 4LZ)(||xf - x *|| 2 + ||z— 1 - x*|| 2 ) . 

Before we prove this lemma let us make a few remarks. First, if l = 0 then Lemma 5.2 is identical 
to Lemma A.2, In general, the second term in the above upper bound has a factor 8 1 2 + 4 LI in 
the front which increases as l increases. We can also compare Lemma 5.2 to that obtained by 
Shalev-Shwartz for sum-of-non-convex objectives: he showed ||£f — V/(xf) || 2 < 0(L 2 ) • (||xf — 
x* || 2 + Hz - 1 — x*|| 2 ) in [221 which is suboptimal to ours and exactly why the L 2 factor shows up 
in his final gradient complexity. 


Proof of Lemma 5.2 The first step of the proof of this lemma is analogous to most of the variance 
reduction literatures (cf. 5. 13, 29]): 

Ej® [||£f — V/(xf)|| 2 ] = Ej| [|| (V /j| (xf) — V /i| (x s_1 )) - (V/(xf) — V/(x s_1 )) || 2 ] 
<E i .[||V/ i? (x?)-V/ i| (z- 1 )|| 2 ] 

= Ej® [|| (V/ i? (xf) - V/i|(x*)) - (V/^z- 1 ) - V/j|(x*)) || 2 ] 

< 2 • E i? [||V/i«(zJ) - V/ it .(x*)|| 2 + ||V/i-(z- 1 ) - V/i«(x*)|| 2 ] . (B.2) 
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Above, ® is because for any random vector ( E M d , it holds that E||^ — E£|[ 2 = IE||£|| 2 — ||EC|| 2 , and 
© is because for any two vectors a, b E M d , it holds that ||a — b\\ 2 < 211a 1 1 2 + 2||5|| 2 . 

For analysis-purpose only, we define 4>i(y) '= fi(y) — (V/j(x*), y) + k\\v ~ x *\\ 2 for each j E [n]. 
It is clear that fi(y) is a convex, (L + ^-smooth function that has a minimizer y = x* (which can 
be seen by taking the derivative). For this reason, we claim that 

M x *) < Mv) ~ ^p|l|V0j(y)|| 2 , (B.3) 

for each y, and this inequality is classical for smooth functions (see for instance Theorem 2.1.5 in 
the textbook [19]). By expanding out the definition of </>«(•) in (B.3). we immediately have 


< fi(y) - (Vfi(x*),y) + l -\\y - x*|| 2 

- ~ V/iOO + l{y - x*)|| 2 

which then implies 

||V/i(y) - Vfi(x*)\\ 2 < 2||V/i(y) - V/*(x*) + l(y - x*)|| 2 + 2 \\l(y - x*)|| 2 

< 2 (L + l)(fi(y) - - (Vfi(x*), y - x*)) + (4 1 2 + 2Ll)\\y - x*|| 2 . 

(B.4) 

Now, by choosing y = x* and i = if in (B.4). we have 
Eij[||V/i t »(*!)-V/ i .(**)|| 2 ] 

< Ei. [2(L + l)(fi t (x s t ) - / if (x*) - (V/^x*), x\ - x*))] + (4/ 2 + 2Ll)\\x s t - x*|| 2 
= 2 (L + l ) (f(x 9 t ) - f(x*) + (g*,x$ - x*» + (Al 2 + 2Ll)\\x* t - x* || 2 

< 2 {L + l)(f(x s t ) - f(x*) +^(x s t ) -ip(x*)) + (4/ 2 + 2Ll)\\x s t - x*|| 2 

= 2{L + l)(F(x s t ) - F(x*))+ (Al 2 + 2Ll)\\x s t - x*\\ 2 . (B.5) 


Above, g* E d^(x*) is the subgradient of T at x* that satisfies V/(x*) + g* = 0. 

Similarly, by choosing y = x s_1 and i = if in (B.4). we have 

Eif [||V/i|(x s_1 ) - V/ i? (x*)|| 2 ] < 2(L + l){F^- 1 ) - F(x*)) + (4/ 2 + 2Ll)\\x s ~ l - x*|| 2 . (B.6) 

Finally, putting together (B.2). (B.5) and (B.6) we finish the proof of the desired lemma. EH 

Finally, we are ready to prove our main theorem of this section: 

Proof of Theorem 5.1 Combining Lemma B.l with u = x*, Lemma 5.2. as well as the assumption 
that l < L, we have 

Ei t . [F(x s t+l ) - F(x*)] < ~ F(x*) + F(x s ~ l ) - F(x*) + |||x f s - x*|| 2 + ^Wx'- 1 - x* 

(1 — ag)\\x s t — x*\\ 2 — Ej|||xf +1 — x*|| 2 
+ 2 g 
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Choosing rj = min{ Ar , g^rr} in the above inequality, we conclude that 


% [F(x s t+1 ) - F(x*)] < ± (F(xf) - F(x*) + F(x s ~ l ) - F(x*)) + ^Hx*" 1 - x*" 2 


+ 


|x£ — x*\\ 2 — ||x£ +1 — x 

2 7] 


Summing it up over t = 0,1,..., m — 1, and dividing both sides by m, we arrive at 


m— 1 

e ie 

t =o 


F(x : 


t+1^ 

m 


-F(x*)]<E - ^ 


-I Ffl 1 y—i / o 

1 / V- F(x a t 

m 


-F(x*)+F(x s_1 )-F(x*) + 


t =o 


I _ —*112 

I Jb n Jb 


o 


jrs-1 —* 112 


—I-x a_ -x’ 

2r] ■ m 10 


After rearranging we have 

m —1 


4E[ F ^ +1 ^ - F(x*)J < IE 


t =0 


m 


(F(x5)-F(x*))-(F(x^)-F(x*)) 


+ F(F _1 ) -F(x*) 


+ 


m 


I I I 2 --j- 

l X 0 X II , _ ™*||2 

2r7/5 -m2 11 11 


< (1 + -) (F(x s_1 ) - F(x*)) + (— + 1) (F(x s_1 ) - F(x*)) . 

\ \(jrjm 

Above, the last inequality uses the fact that x* is a minimizer of F(-) as well as our choice Xq = x s . 
Using the convexity of F(-) we have F(x s ) < ’ Y^T= l ^( x t) an d therefore the above inequality gives 


E[F(x s ) - F(x*)} < 2 + m + CT??m (F(F _1 ) - F(x*)) . 


□ 


C Convergence Analysis for Section 6 

This section is devoted to proving Theorem 6.1. We use the same notation as in Section 5 and 
Lemma B.l remains true here. We replace Lemma 5.2 with the following: 

Lemma C.l. 

[ll£t — V/(x*)|| 2 ] < (8L 2 + 4L?)(||xf — x*|| 2 + ||x s_1 — x*|| 2 ) . 

Proof. We begin the proof by first recalling (B.2) from the proof of Lemma 5.2, 

Ei? [lie? - V/(xf)|| 2 ] < 2 • Eis [|| V/i«(x|) - V/j|(x*)|| 2 + ||v/i|(x s_1 ) - V/ if (x*)|| 2 ] . (B.2) 

This time, we define <j>i(y) = —fi(y) + (V/j(x*),y) + ^||y — x*|| 2 for each i G [n]. It is clear that 
4>i(y) is a convex, (L + Z)-smooth function that has a minimizer y = x* (which can be seen by taking 
the derivative). For this reason, we claim that 

M**) <Mv) ~^ 7 llv<My)|| 2 , (c.i) 

for each y, and this inequality is classical for smooth functions (see for instance Theorem 2.1.5 in 
the textbook [19]). By expanding out the definition of </>«(•) in (C.l), we immediately have 

- fa*) + (V/j(x*), x*) < -fi(y) + (Vfi(x*),y) + ||| y- x*|| 2 

- ^||V/,(y) - V/i(x*) - L(y - x*)|| 2 
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which then implies that 

IIV/j(y) - V/i(x*)|| 2 < 2||V/j(y) - V/*(F) - L(y - x*)|| 2 + 2||% - x*)|| 2 

< 2 (L + l)(fi(x*) - fi(y) + (Vy - x*)) + (4 L 2 + 2LZ)||y - F|| 2 . 

(C.2) 

Now by choosing y = xf and i = if in (C.2), we have 
E i?[||V/i ? (*t a )-V/ij(x*)|| 2 ] 

< E*. [2(L + Z)(/ if (x*) - hs(xf) + (Vfq(x*), x s t - x*»] + (4L 2 + 2Ll)\\x s t - F|| 2 

= 2(L + Z)(/(x*) - f(x s t ) + (' Vf(x*),x$ - x*)) + (4L 2 + 2LZ)||xJ - F|| 2 

<(AL 2 + 2Ll)\\x s t -x*f . (C.3) 


Above, the second inequality uses the convexity of /(•). Similarly, by choosing y = x s 1 and i = if 
in (C.2). we have 

E i?[||Wi?(F" 1 )-V/ i .(x*)|| 2 ] < (4L 2 + 2LZ)||F _1 — x*|| 2 . (C.4) 

Finally, putting together (B.2). (C.3) and (C.4) we finish the proof of the desired lemma. EH 

Finally, we are ready to prove our main theorem of this section: 

Proof of Theorem 6.1 Combining Lemma B.l with u = x*, Lemma C.l. as well as the assumption 
that L < l, we have 


E.: ; F« +1 )-C(x*)] < - x *\\ 2 + - x'\\ 2 ) 


+ 


(1 ~ ov)\\x s t - x* || 2 - Ej ? ||xg +1 - 

2 rj 


Choosing r/ = ™ the a bove inequality, we obtain that 

E*? [FfrU i) - F(x*)] < JlF " 1 - F || 2 + lkf ~ Xl|2 ~^ fl|Xf+1 ~ — 

Summing it up over t = 0,1,..., m — 1, and dividing both sides by m, we arrive at 


m— 1 


E[^ F ^ +l) -F(x*)] <E n|3:n X 


t =0 


m 


* 112 


O , 


XS —1 * ||2 


2r/ ■ m 


+ T r - X* 

4 1 


Finally, using our choice Xg = x s_1 , using the convexity of F(-) which tells us F(x s ) < ^ F(x |), 
and using the strong convexity of F(-) which tells us l|||x s-1 — x*|| 2 < F(x s_1 ) —F(x*), we conclude 
from the above inequality that 


2 H—— 

E[F(F) - F(x*)j < --pCt(F(x s_1 ) - F(x*)) 


□ 
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D Convergence Analysis for Section 7 


We use the same notations of if and £/ as in previous sections. The following lemma is exactly 
Lemma A.l where the step length i] is replaced with r)f +1 \ 

Lemma D.l I Lemma A.l revised). For every u £ and t e {0,1,... ,m s — 1}, fixing xf and 
letting i = if be the random variable, we have 


[F(x f +1 ) - F(u)] < Ej. 


Vt+i 


s\ 112 , W^t 


2(1 - vt+iL) 




xf — u|| 2 — ||xf +1 — u|| 2n 


2r lt+i 


Also, by combining Lemma 5.2 (for l < L ) and Lemma C.l (for l > L ), we have that for every 
l > 0 , 

Lemma D.2. 

Ei| [\\% - V/(x?)II 2 ] < 8 L ■ {F(xf) - F(x*) + F(x s ~ l ) - F(x*)) 

+ 12Ll(j\xf — x*\\ 2 + HF^ 1 — F|| 2 ) . 

Now we are ready to prove a lemma that is different from all previous sections. 

Lemma D.3. If m° > 1 , rj < 1/13 L, and > 39 r]Ll, we have 


E[F(x s ) - F(®*)] < 


F(xP) - F{x*) 39r/Ll\\x^ — x* || 2 . ||F-x*" 2 


2^-1 


+ 


25 


+ 


2 S . 4 > ?o m o 


(D.l) 


Proof. Combining Lemma D.l with u = x* and Lemma D.2. as well as using the fact that rjf +1 < 77 , 
we have 

E i? [F(xf +1 ) - F(x*)] < [ F (xf) - F(x*) + Fix*- 1 ) - F(x*) + 3l\\xf - F|| 2 + 3ZHF' 1 - F| 


+ 


\xf — a;* ||“ — ||xf +1 — x 

2^ 


Choosing r/ < 1/13L in the above inequality, we have 

Ei t * [*K+i) - F(x*)] < l{F(xf) - F{x*) + Fix^ 1 ) - F(x*)) + 13 V Ll(\\xf - F|| 2 + H®- 1 - F|| 2 ) 


+ 


1 


\xf — F|j 2 — Ejs||xf +1 — || 2 


2 n-, 


t +1 


< -{F(xf) - F(x*) + Fix 8 - 1 ) - F(x*)) + 13 rjLl( - 2\\xf - F|| 2 + 

O 


s ~*" 2 1 "F _1 - x* 


+ 


\xf-x *\\ 2 Ejs ||xf +1 — x* || 2 


2 Vt 


2r lt+i 


where the last inequality uses (7.1) and the assumption that > 39 r]Ll. 

Summing it up over t = 0,1,..., m s — 1 and dividing both sides by m s , we arrive at 


m s -1 


E[ F< " Xt+1 ^ + 26riLl \\ Xt —i^-F(F)] <E Y - F ix*) + Fix 8 - 1 ) - F(x*)) 

t= o ms ~ m ° 


t =0 


||^v»S ™*||2 N'T* IT'S ll^n 

1Q II —s—1 * II2 , x 0 x r*' 

+ 13 • \\X — X + 


2 Vq ■ m s 2g s ms ■ m s 


22 



























After rearranging, this yields 


m .—1 


2E[ ■£ 


F(xf) + 39r]Ll\\xf — x* " 2 


t =o 


m. 


-F(x*)] < E 


3(F(*g) - F(x*)) - 3(F(x£j - F(x*)) 


m. 


+ F(x s_i ) 


11^5 _ 112 ll'T’* _ nr** ||2 

+ 39 • Hx- 1 - x* || 2 + l|X ° X 11 - " , 

2^o/3 • m s 2r\ s m J?> • m s J 

Next, using the fact that F(x s ) < and ||x s — x*|| 2 < A- Y1T=Y ' ll x t ~ x*|| 2 which follow 

from convexity and the definition x s = YYt=o ' ~i we can we rewrite the above inequality as 

-3(F(xg) - F(x*)) - 3(F(x£J - F(x*)) 


2E[F(F) - F(x*) + 39 t/LZ||x* - x*|| 2 ] < E 


m. 


+ 39 • ||x s_1 — x* II 2 + 


|™s _ ™* II2 

iX/rj «X/ 


+ F(x s ~ l ) - F(x* 

s 112 


p _ X'’* 


2^o/3 ' m s 2r] s ms /2,-m s \ 


At this point, let us recall choice x s mg = Xq +1 , r]^ ls = rj g +i , and m s = 2m s _i, which yield 


n S+1 


2E[F(x*) - F(x*) + 39?/LZ||x s - x*|| 2 + 


x — X, 


s+1 11 2 


4t7q +1 /3 • m s 


+ 


< E 


F(x s_i ) - F(x*) + 39?/L/||x s_i - x*|| 2 + 


In — X 


* 112 


Fjx^-Fjx'h 

2m s /3 

F(xg)-F(x*) 


+ 


4t7q/3 • m s -i 2m s _i/3 


In sum, after telescoping for s = 1, 2,..., 5, we have 
E[F(x s ) - F(x*)} < 2~ s • (f(T°) - F(x*) + 39r]Ll\\^ 


x — X || + 
* 112 


2 , ik* -^oll 2 + F i x o)-F(x*) 


< F(x^) — F(x*) 39r]Ll\\x^ — x~|r Hx 1 '' — x 


4??o / 3 • m 0 
* 112 


2mo 


2 s-i 


25 


oS 4? ? m o 
Z ' 3\/2 


□ 


Finally, the above lemma immediately yields our desired theorem: 

Proof of Theorem 7.1 Under the given parameter choices, we first have 

Ay/2Tri ~ Ay/2 V ■ 2 m 0 ■ 2 s ~ 8y/2r]mo • f ~~ 8\/20 ~~ 39 312y/20 “ 
so the preassumption of Lemma D.3 holds. 

Now we consider the three terms on the right hand side of (D.l). The first term is no more 
than < 2e. The second term is no more than 

39 r/LlQ 39 r/LlQ e e 

2 s A 8^2A 8y/2 

The third term is no more than 

0 _ 0 3\/2 

“ vUS “ 

In sum, we conclude that E[F(x s ) — F(x*)] < 0(e). D 


F(x*) 


)) 


23 









































training loss - optimum training loss - optimum training loss - optimum training loss - optimum 






(g) Adult, Lasso a = 10 


(h) Adult, Logistic a = 10 







(i) Adult, Ridge a = 10 



(j) Adult, Lasso a = 10 6 

Figure 3: Training error comparisons on dataset Adult, using Tuning Type I. 
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Figure 4: Training error comparisons on dataset Covtype, using Tuning Type I. 
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Figure 5: Training error comparisons on dataset Ijcnnl, using Tuning Type I. 
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(d) Mnist, Lasso a = 10 
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Figure 6: Training error comparisons on dataset mnist, using Tuning Type I. 


27 



















































































































training loss - optimum training loss - optimum training loss - optimum training loss - optimum 


10° 

10 

10 

10 

10 

10 

10 

10 

10 

10 

10 - 

10 - 



I -I 

SDCA(r=1 .OE-4) 


f- 

SAGA(0.03) 

\ 'At 

l~l 

SVRG(0.03) 

SVRG-Auto-Epoch(0.09) 


M 

SVRG++(0.2) 

v\ V '* 



\ 



lud ''l' 

t-i 


T\ 




grad / n 


10 c 

10- 1 

io- 

I 10- 

L io- 

! 10_E 
I 10— 

p 10- 

; io- 
' 10- 
10- 1C 
10- 11 



I-I 

SDCA(r=2.0E-5) 


f-i 

SAGA(O.I) 

t-t 

SVRG(0.09) 

\ * 

J—l 

SVRG-Auto-Epoch(0.7) 


M 

SVRG++(0.5) 


15 

grad / n 


10° 
10- 
10 - 
\ 10 - 
L io - 4 
* io— 

; 10 - 

p 10— 

; io- 
' 10- 
10- 10 
10- 11 , 


15 

grad / n 


\ At 

I -I SDCA(r=5.0E-4) 
i-i SAGA(O.OI) 

I--I SVRG(0.02) 

| SVRG-Auto-Epoch(0.08) 
M SVRG++(0.09) 

A 'A 

TT, 

X '« I 


(a) Adult, Lasso a = 10 


(b) Adult, Logistic a = 10 


(c) Adult, Ridge a = 10 



(d) Adult, Lasso a = 10 




(g) Adult, Lasso a = 10 






Figure 7: Training error comparisons on dataset Adult, using Tuning Type II. 
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Figure 8: Training error comparisons on dataset Covtype, using Tuning Type II. 
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(g) Ijcnnl, Lasso a = 10 
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Figure 9: Training error comparisons on dataset Ijcnnl, using Tuning Type II. 
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Figure 10: Training error comparisons on dataset mnist, using Tuning Type II. 
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